Automated web scraping using Obsidian Web Clipper and Puppeteer.

Obsidian is a powerful note-taking system. Using Puppeteer and the Obsidian Web Clipper, you can automatically download web content into your vault!

This is a slightly unorthodox way of sharing my process for scraping websites using obsidian, but this project was rather disjointed. I figured it would be easier for me to explain in narrative form what each item is intended to do.

Obsidian web scraper

The official obsidian web scraper is amazing. The scraper that I built uses this chrome plugin to power the parsing and sending data to obsidian. You can use this for lots of cool and interesting things.

Once you have the extension installed, you can go here to access a bunch of templates. This can give you a feel for what the extension can do and how it works. For my recipe scraper, I customized a few pieces of the recipe template included in the link above.

Problem: Automation

Once I had set up the template the way that I wanted to for the web scraper, I was ready to start scraping! Unfortunately the scraper is intentionally designed to activate only manually. There was no built in way to trigger the function automatically. Because I wanted to scrape 1000's of recipes, manually navigating to each page and then triggering the extension would be way too much work.

I will spare you the long and boring details of how I figured this out. The short version is that the only way I could figure out how to trigger the extension automatically was to build my own chrome extension, and then connect it to a local version of the Obsidian extension that I had customized to allow for external connections.

Chrome extension for auto trigger

The code for the custom chrome extension is found in this repo.

Steps to install it on chrome:

  1. Clone the project
  2. Open chrome extension manager and enable dev mode
  3. Once enabled press load unpacked and select the parent folder (should have manifest.json in it)

The extension basically fires a message immediately when you open a new url that matches one of the strings in the 'matches' section of the manifest.json file.

manifest.json
"content_scripts": [
		{
			"js": ["scripts/content.js"],
			// Make sure to update this array to include any URLs you
			// want to scrape
			"matches": ["https://developer.chrome.com/docs/extensions/*",
			 "https://developer.chrome.com/docs/webstore/*"]
		}
	]
This will trigger EVERY time, whether you have scraped a page already or not

You will also want to make note of the id at the top of the background.js file. You will need to set this to the id of the local obsidian web scraper that you 'customize.'

Background.js
// This needs to be set to the id of the extension you want to trigger
const const id: "lemfefnbebfkbajcafkjoklibjadafasg"id = 'lemfefnbebfkbajcafkjoklibjadafasg';

Obsidian customizations

Obsidian uses the browser command and messages API to trigger the plugin whenever you hit the correct keyboard shortcut. You can hook into this pretty easily without changing a lot of code, which enables the web clipper to receive the trigger command from our custom extension above.

All you need to change is to add the following code snippet to line 183 of the src/background.ts file.

src/background.ts
browser.runtime.onMessageExternal.addListener(
	(
		request: unknownrequest: unknown,
		sender: browser.Runtime.MessageSendersender: browser.Runtime.type browser.Runtime.MessageSender = /*unresolved*/ anyMessageSender,
		sendResponse: (response?: any) => voidsendResponse: (response: anyresponse?: any) => void
	): true | undefined => {
		if (typeof request: unknownrequest === "object" && request: object | nullrequest !== null) {
			const 
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
= request: objectrequest as {
action: stringaction: string; isActive?: boolean | undefinedisActive?: boolean; hasHighlights?: boolean | undefinedhasHighlights?: boolean; tabId?: number | undefinedtabId?: number; }; if (
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
.action: stringaction === "triggerQuickClip") {
browser.tabs .query({ active: booleanactive: true, currentWindow: booleancurrentWindow: true }) .then((tabs: anytabs) => { if (tabs: anytabs[0]?.id) { browser.action.openPopup(); function setTimeout<[]>(callback: () => void, ms?: number | undefined): NodeJS.Timeout (+2 overloads)
Schedules execution of a one-time `callback` after `delay` milliseconds. The `callback` will likely not be invoked in precisely `delay` milliseconds. Node.js makes no guarantees about the exact timing of when callbacks will fire, nor of their ordering. The callback will be called as close as possible to the time specified. When `delay` is larger than `2147483647` or less than `1`, the `delay`will be set to `1`. Non-integer delays are truncated to an integer. If `callback` is not a function, a `TypeError` will be thrown. This method has a custom variant for promises that is available using `timersPromises.setTimeout()`.
@sincev0.0.1@paramcallback The function to call when the timer elapses.@paramdelay The number of milliseconds to wait before calling the `callback`.@paramargs Optional arguments to pass when the `callback` is called.@returnfor use with {@link clearTimeout}
setTimeout
(() => {
browser.runtime .sendMessage({ action: stringaction: "triggerQuickClip" }) .catch((error: anyerror) => sendResponse: (response?: any) => voidsendResponse({ success: anysuccess: error: anyerror }) ); }, 500); } }); return true; } // For other actions that use sendResponse if (
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
.action: stringaction === "extractContent" ||
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
.action: stringaction === "ensureContentScriptLoaded" ||
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
.action: stringaction === "getHighlighterMode" ||
const typedRequest: {
    action: string;
    isActive?: boolean | undefined;
    hasHighlights?: boolean | undefined;
    tabId?: number | undefined;
}
typedRequest
.action: stringaction === "toggleHighlighterMode"
) { return true; } } return var undefinedundefined; } );

Once this has been changed, recompile the obsidian plugin and load the new custom version we just set up into chrome as well. It may be useful to remove/disable the original obsidian web clipper plugin from chrome while you do this.

Templates

You will need to load any of your custom templates and settings into the 'custom' web clipper. It is easiest to just export all settings from the 'real' web clipper and import them into the one you made.

After installing the the custom web clipper, be sure to update the id for the other extension we installed so it sends the trigger to the right place.

If you have done everything correctly, you should be able to visit any web page that matches your trigger extension rules and have it automatically downloaded into obsidian!

Final steps

Now that all of that is working, I needed some way to visit each site, pause for a second to allow obsidian to 'scrape' and then move on. I created a Huge array of URL's that I wanted to scrape and then used a short puppeteer script to manage this relatively easily.

index.js

const asyncWait = async (time) => {
	await new Promise((resolve) => setTimeout(resolve, time))
};

(async () => {


	// MAC: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check --user-data-dir=$(mktemp -d -t 'chrome-remote_data_dir')
	
	// PC: start chrome.exe –remote-debugging-port=9222
	
	// Note: this url changes each time the command is run.
	const wsChromeEndpointUrl = 'YOUR_URL_HERE'
	const browser = await puppeteer.connect({
		browserWSEndpoint: wsChromeEndpointUrl,
	})

	for (let i = 0; i < cleanedLinks.length; i++) {
		const page = await browser.newPage()
		console.log(i + cleanedLinks[i])
		await page.goto(cleanedLinks[i], {
			waitUntil: 'domcontentloaded', 
		})
		await asyncWait(2000)

		await asyncWait(Math.random() * 1000)
		await page.close()
	}

	return
})();
Headless puppeteer will not work

You have to connect puppeteer to an existing instance of chrome. This is what the wsChromeEndpointUrl is.

If you run the command from a terminal, it should open a new chrome instance. The url variable you need will be in the terminal output. (Make sure not to close the chrome instance the terminal opened until you are done, as it will reset the url each time you open a new one)

Install both custom extensions into the newly launched chrome instance. Run the puppeteer script, and watch as your scraper begins loading content into your library automatically!!